The goal of this analysis is to identify synthetic lethal interactions with KRAS G13D mutations.
The data is from the genome-wide CRISPR-Cas9 loss-of-function screen named “Achilles” from the DepMap Project. This data was downloaded, cleaned, and made available in the ‘tidy_Achilles’ GitHub repository.
The following plot shows the number of cell lines in CCLE with either a KRAS or NRAS mutation (only showing alleles found in at least two cell lines of a disease).
However, the DepMap has yet to screen all of the CCLE cell lines. The following only includes the cell lines used in Achilles.
ids_screened <- readRDS(file.path("data", "Achilles_gene_effect.tib")) %>%
pull(dep_map_id) %>%
unlist() %>%
unique()
As of the 2019Q2 release, there are 563 screened cell lines. The most frequent allele is KRAS G12D, and the organs with the most frequent KRAS mutants are colorectal, lung, and pancreas. The following table shows the number of cell lines with each KRAS allele.
readRDS(file.path("data", "ras_mutants_info.tib")) %>%
filter(dep_map_id %in% !!ids_screened & ras == "KRAS") %>%
count(ras_allele, disease) %>%
filter(n >= 2) %>%
mutate(disease = str_to_title(str_replace_all(disease, "_", " ")),
ras_allele = str_replace_all(ras_allele, "_", " ")) %>%
arrange(disease, desc(n)) %>%
dplyr::rename(`RAS allele` = "ras_allele",
`Origin of cell line` = "disease",
`num. of cell lines` = "n")
Below are the distributions of the depletion scores in the three organs with the highest frequency of KRAS mutations.
They are not normally distributed. Instead, most of the values lie near 0, indicating that most knock-out events had little impact on the viability of the cell lines (as expected). There is a tail to the left filled with the genes that did have an effect when knocked-out.
(Analysis conducted in subscripts/linear_model.R.)
For all of the following analyses, these two conditions were met:
The first attempt at modeling the data used a standard linear model to estimate the depletion effect given the KRAS allele, grouped as either WT, KRAS G12, or KRAS G13D.
The volcano plot below shows the difference in the estimates for KRAS G12 and KRAS G13D and the log-transformed BH FDR-adjusted p-value (q-value) of the model. The labeled genes had a model q-value less than 0.20 and a difference in estimate of magnitude greater than 0.2. The points to the left (blue) had a stronger depletion effect in KRAS G12 mutants where the points to the right (red) had a stronger depletion effect in KRAS G13D.
The genes with very low model q-values but little difference between KRAS G12 and G13D estimates (middle of the x-axis and high on the y-axis) were likely highly predicted by the WT covariate (i.e. the gene effect did not change much according to the KRAS allele).
The plot below compares the coefficients fit to KRAS G12 (x-axis) and G13D (y-axis). Values that lie along the dashed line showed no difference in depletion effects between the two mutant KRAS groups. The highlighted genes had significant models (q-value < 0.20) and a difference in coefficients of 0.2 in magnitude.
The plot shows target genes that had a significantly stronger depletion effect in KRAS G13D cell lines. These genes showed increased depletion in KRAS G13D cell lines, had a model q-value below 0.20, and a p-value for the G13D covariate below 0.05.
On the other hand, the following plot shows target genes that had a significantly weaker/reduced depletion effect (relative to the other alleles tested) in KRAS G13D cell lines.
The next attempt at modeling the data used a standard linear model to estimate the depletion effect given the KRAS allele and mutation status of the target gene. The model had two covariates, KRAS allele and the mutational status of the target gene (binary). The alleles were grouped as KRAS codon 12, KRAS G13D, or WT. Only genes that caused a depletion to -0.25 or lower in at least one cell line were used.
The volcano plot below shows the difference in the estimates for KRAS G12 and KRAS G13D and the log-transformed BH FDR-adjusted p-value (q-value) of the model. The labeled genes had a model q-value less than 0.20 and a difference in estimate of magnitude greater than 0.2. The points to the left (blue) had a stronger depletion effect in KRAS G12 mutants where the points to the right (red) had a stronger depletion effect in KRAS G13D.
The genes with very low model q-values but little difference between KRAS G12 and G13D estimates (middle of the x-axis and high on the y-axis) were likely highly predicted by the WT covariate (i.e. the gene effect did not change much according to the KRAS allele) or mutational status.
The plot shows target genes that had a significantly stronger depletion effect in KRAS G13D cell lines. These genes showed increased depletion in KRAS G13D cell lines, had a model q-value below 0.20, and a p-value for the G13D covariate below 0.05.
On the other hand, the following plot shows target genes that had a significantly weaker/reduced depletion effect (relative to the other alleles tested) in KRAS G13D cell lines.
I have additionally fit the same model without WT samples, thus the intercept was the KRAS G12 effect and other covariates were KRAS G13D and mutational status of the target. Looking at the target genes that could be modeled and how their results compare to the previous version of the model (with WT), they do not capture any new effects. I believe including the WT as the intercept is a logical inclusion.
I ran the same model as before, including KRAS WT, G12, and G13D and whether the target gene was mutated or not, now including the RNA expression levels of the target gene in each cell line.
The following plot is a volcano with the KRAS G13D effect on the x-axis and log-transformed p-value of the KRAS G13D coefficient on the y-axis. The highlighted genes had an overall model q-value below 0.20, KRAS G13D p-value below 0.05, and a KRAS G13D estimate of magnitude greater than 0.20.
The volcano plot below shows the difference in the estimates for KRAS G12 and KRAS G13D and the log-transformed BH FDR-adjusted p-value (q-value) of the model. The labeled genes had a model q-value less than 0.20 and a difference in estimate of magnitude greater than 0.2. The points to the left (blue) had a stronger depletion effect in KRAS G12 mutants where the points to the right (red) had a stronger depletion effect in KRAS G13D.
The plot below compares the coefficients fit to KRAS G12 (x-axis) and G13D (y-axis). Values that lie along the dashed line showed no difference in depletion effects between the two mutant KRAS groups. The highlighted genes had significant models (q-value < 0.20) and a difference in coefficients of 0.2 in magnitude.
The plot shows target genes that had a significantly stronger depletion effect in KRAS G13D cell lines. These genes showed increased depletion in KRAS G13D cell lines, had a model q-value below 0.20, and a p-value for the G13D covariate below 0.05.
The plot shows target genes that had a significantly reduced depletion effect in KRAS G13D cell lines. These genes showed reduced depletion in KRAS G13D cell lines, had a model q-value below 0.20, and a p-value for the G13D covariate below 0.05.
The following plots show the trend of gene expression with depletion score for models found to have a significant coefficient and estimate (slope) of at least 0.15 in magnitude.
Inspecting the target genes that had statistically significant coefficients for the mutational status covariate (non-synonymous mutations only) revealed that many models were significant with only one or two cell lines with a mutation. The plot below shows the models with significant mutational status coefficients with an effect size of at least 0.15 in magnitude and at least four cell lines with a mutation in the gene.
From the previous observation that some models were fitting significant and relatively large coefficients to the mutational status covariate when only couple of cell lines had a mutation, I restricted the inclusion of this covariate to models of genes mutated in at least four cell lines.
The following plots have the same layout as in previous models, so are not specifically explained in each tab.
Instead of including all codon 12 KRAS mutants in one group, the following model is the same as used in the previous section, though only using KRAS G12D. This may reduce variability of the codon 12 group if the various allele that constitute it are quite different.
I looked at the frequency of co-mutation of the genes identified by the linear model with KRAS G12, KRAS G13D, and WT in human tumor samples from COAD, LUAD, and PAAD. Below is a heatmap colored by co-mutation frequency. The numbers in the cells indicate the number of co-mutation events.
The following heatmap has the same information, however the fill color indicates the co-mutation frequency scaled from 0 to 1 for each gene individually. This gives an indication of the relative rates of co-mutation for each gene.
Below are the same plots, but the columns (target genes) are hierarchically clustered.
With the co-mutation frequencies rescaled within each gene.
I used the ‘enrichR’ package to identify functionalities enriched in the genes that had G13D-specific genetic dependencies. The bar-plot below shows the identified functions along the y-axis and the corresponding p-value along the x-axis. The fraction at the end of each bar is the fraction of genes associated with the term that are in the list. The genes along the bar are those that are associated with the term.
I extracted the subnetwork of the protein-protein interaction network (source: High-quality INTeractomes database; nodes = genes, edges = physical interactions) that contained most of the genes that had G13D-specific genetic dependencies and any other nodes that are connected to at least three of these genes. The two plots below show the same nodes, but the second only includes the physical interactions with the genes with genetic effects.
The shade of the edge indicates the how many of the nodes it connects were from the genetic dependency analysis (either “both”, “one”, or “neither”). The color of the node indicates the coefficient; a positive coefficient (red) indicates that the targeting of the gene was predictive of reduced lethality, and blue the opposite. The size of the node indicates its centrality in the network (an indicator of importance).
The following plot of the subnetwork of the PPI shows only the nodes of genes identified to have a G13D genetic dependency.
The subnetwork can be dissected by clustering.
The degree of a node (number of neighbors) can be another indication of importance. The following figure plots the degree of each gene in the subnetwork against the KRAS G13D coefficient from the modeling of genetic dependence.
I compared the genes identified in this analysis to those identified by Alex from UCSF. There was no overlap between the genes identified by differential gene expression (DGE; q-value 0.05, log-fold change > 2) and modeling G13D-specific dependencies (model q-value < 0.2, coefficient p-value < 0.05, coefficient of magnitude > 0.15).
The following figure is the volcano plot of the G13D coefficient against its p-value. The blue points are genes significantly differentially expressed in any of the mouse model DGE comparisons. The top selection of those are labeled.
I conducted a similar extraction of the PPI subnetwork for the differentially expressed genes (DEG).
I merged this with the same network from the G13D-dependent genes.
The plot below shows the same subnetwork, just KRAS is placed at the center and the other nodes are located in rings around KRAS based on their distance to (number of edges between) KRAS.
However, if I only use the differentially expressed genes from the G12D vs G13D comparison, no subnetwork can be built - none of the nodes are connected by bridging nodes like was seen with the genes from the dependency analysis. Some of the genes do fall within the subnetwork extracted from the dependency analysis. The plot below shows the PPI subnetwork comprised of the genes with G13D dependencies (triangle), those found to be differentially expressed between KRAS G12D and G13D (squares), and any “bridge” node that is connected to at least 3 of those genes (grey circles).
The following plot is of the same subnetwork, but clustered by connectivity. The edges between “bridge” nodes have been removed.
Instead of using the mutational status of KRAS to explain the depletion effect of each targeted gene in a separate model, I tried to predict the status of KRAS using all of the depletion scores in one model.
I tried LASSO and Ridge for penalized regression and found that the strong effects of LASSO were preferable due the high dimensionality of the problem. I began by just using LASSO to predict whether a cell line would have WT or mutant KRAS. Below are the genes with non-zero effects (not including the effect of knocking-out KRAS).
The following box-plots show the depletion effects in mutant and WT KRAS cell lines for the identified genes.
For comparison, I ran a standard linear model for each gene using KRAS as the only predictor, limited to WT or mutated. The following volcano plot shows where the LASSO-identified genes fall in the spectrum of results obtained from this modeling. None met the criteria used in the previous section of this analysis: model q-value < 0.20, coefficient p-value < 0.05, and magnitude of coeffcient > 0.15.
I tried using a random forest classifier to classify whether KRAS was WT or mutated in a cell line based off of the depletion effects of all genes, but this had poor accuracy (around 50%).
However, if only the genes from the LASSO-regularized regression are used, the training error rate drops to 6.25% and the accuracy with the test data is 1.
Using machine learning algorithms requires the split of the data into testing and training data sets. Due to the relatively low number of KRAS G13D samples, these methods are not applicable for an allele-specific analysis.
Still, the LASSO-regularized linear regression on KRAS mutational status was interesting and may have identified a few avenues of further exploration.